STA 101L - Summer I 2022
Raphael Morsomme
simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]
multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]
categorical predictor
feature engineering
Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.
Let us see if the same idea work with the trees dataset.
Suppose we want to predict volume using only the variable girth.
One could argue that there is a slight nonlinear association
We start with the simple model
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m1, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.3 5.10
2 8.30 5.11
3 8.30 5.11
4 8.30 5.12
5 8.30 5.12
6 8.31 5.13
To capture the nonlinear association between Girth and Volume, we consider the predictor \(\text{Girth}^2\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]
The command to fit this model is
d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)[1] 0.962
\(R^2\) has increased! It went from 0.935 (model 1) to 0.962.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2)
Volume_pred <- predict(m2, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 3
Girth Girth2 Volume_pred
<dbl> <dbl> <dbl>
1 8.3 68.9 11.0
2 8.30 68.9 11.0
3 8.30 68.9 11.0
4 8.30 68.9 11.0
5 8.30 69.0 11.0
6 8.31 69.0 11.0
Let us consider the predictor \(\text{Girth}^3\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \beta_3 \text{girth}^3 \]
d_tree3 <- mutate(d_tree, Girth2 = Girth^2, Girth3 = Girth^3)
m3 <- lm(Volume ~ Girth + Girth2 + Girth3, data = d_tree3)
compute_R2(m3)[1] 0.963
\(R^2\) has increased! It went from 0.962 (model 2) to 0.963.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2, Girth3 = Girth^3)
Volume_pred <- predict(m3, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 4
Girth Girth2 Girth3 Volume_pred
<dbl> <dbl> <dbl> <dbl>
1 8.3 68.9 572. 9.88
2 8.30 68.9 572. 9.88
3 8.30 68.9 572. 9.89
4 8.30 68.9 572. 9.89
5 8.30 69.0 573. 9.89
6 8.31 69.0 573. 9.90
What if we also include the predictors \(\text{Girth}^4, \text{Girth}^5, \dots, \text{Girth}^{k}\) for some larger number \(k\)?
The following R code fits such a model with \(k=34\), that is,
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{29} \text{girth}^{34} \]
[1] 0.98
\(R^2\) has again increased! It went from 0.963 (model 3) to 0.98.
As we keep adding predictors, \(R^2\) will always increase.
Is the model m_extreme a good model?
Note
We want to learn about the relation between Volume and Girth present in the population (not the sample).
A good model should accurately represent the population.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.00001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.3 10.3
2 8.30 10.3
3 8.30 10.3
4 8.30 10.3
5 8.30 10.3
6 8.30 10.3
The model m_extreme overfits the data.
A model overfits the data when it fits the sample extremely well but does a poor job for new data.
Remember that we want to learn about the population, not the sample!
We saw that \(R^2\) keeps increasing as we add predictors.
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST} \]
\(R^2\) can therefore not be used to identify models that overfit the data.
The adjusted-\(R^2\) is similar to $R2, but penalizes large models:
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]
where \(p\) corresponds to the number of predictors.
The adjusted-\(R^2\) therefore balances goodness of fit and parsimony:
The model with the highest adjusted-\(R^2\) typically provides a good fit without overfitting.
Two other popular criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are
The formula for AIC and BIC are respectively
\[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \]
where \(\text{GoF}\) measures the Goodness of fit of the model1.
Unlike the adjusted-\(R^2\), smaller is better for the AIC and BIC.
Note that the BIC penalizes the number of parameters \(p\) more strongly.
Easily accessible in R with the command glance. For instance,
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.935 0.933 4.25 419. 8.64e-19 1 -87.8 182. 186.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.962 0.959 3.33 350. 1.52e-20 2 -79.7 167. 173.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.963 0.959 3.35 232. 2.17e-19 3 -79.3 169. 176.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.980 0.960 3.30 48.6 5.64e-10 15 -69.7 173. 198.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In this case, AIC and BIC favor m2, while the adjusted-\(R^2\) (wrongly) favor m_extreme.
The adjutsed-\(R^2\), AIC and BIC all try to balance
They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).
But
m_extreme, although it was clearly overfitting the data.Instead of using these criteria, we could look for the model with the best predictions performance.
That is, the model that makes predictions for new observations that are the closest to the true values.
We will learn two approaches to accomplish this
The holdout method is a simple method to evaluate the predictive performance of a model.
Note that the test set consists of new observations for the model.
\(\Rightarrow\) Select the model with the best prediction accuracy in step 3.
The holdout method.
Source: IMS
The following R function splits a sample into a training and a test set
construct_training_test <- function(sample, prop_training = 2/3){
n <- nrow(sample)
n_training <- round(n * prop_training)
sample_random <- slice_sample(sample, n = n)
sample_training <- slice(sample_random, 1 : n_training )
sample_test <- slice(sample_random, - (1 : n_training))
return(list(
training = sample_training, test = sample_test
))
} Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
set.seed(0)
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
training_set Girth Height Volume
1 11.7 69 21.3
2 16.3 77 42.6
3 10.5 72 16.4
4 11.0 66 15.6
5 8.3 70 10.3
6 8.6 65 10.3
7 14.5 74 36.3
8 11.3 79 24.2
9 20.6 87 77.0
10 13.3 86 27.4
11 13.7 71 25.7
12 17.5 82 55.7
13 11.2 75 19.9
14 18.0 80 51.0
15 14.0 78 34.5
16 17.9 80 58.3
17 11.1 80 22.6
18 10.7 81 18.8
19 14.2 80 31.7
20 12.0 75 19.1
21 11.4 76 21.0
We simply fit our regression model to the training set.
To evaluate the prediction accuracy, we start by computing the predictions for the observations in test set.
A good model will make predictions that are closed to the true values of the response variable.
A good measure of prediction accuracy is the sum of squared errors (SSE).
\[ SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2 \]
Small SSE is better.
In practice, the (root) mean sum of squared errors ((R)MSE) is often used.
\[ MSE = \dfrac{SSE}{m} = \dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m} \]
\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \] Again, small (R)MSE is better.
SSE <- sum((Volume - Volume_hat)^2)
m <- nrow(test_set)
MSE <- SSE / m
RMSE <- sqrt(MSE)
SSE; MSE; RMSE[1] 225.2413
[1] 22.52413
[1] 4.745959
Tip
##(R)MSE The advantage of (R)MSE over the SSE is that we can compare models evaluate with test sets of different sizes.
Apply steps 2 and 3 on different models.
Simply choose the model with the lowest (R)MSE.
This model has the best out-of-sample accuracy.
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
compute_RMSE(test_set, m1)[1] 4.745959
[1] 4.16503
[1] 4.093268
[1] 13.47366
We select m3.
The previous analysis was conducted with set.seed(0).
Models m2 and m3 were pretty close.
Could we have obtained a different result with a different seed?
# Step 1
set.seed(1)
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
test_set <- training_test_sets[["test"]]
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
compute_RMSE(test_set, m1)[1] 6.589842
[1] 4.332965
[1] 5.437581
[1] 310105724
A drawback of the holdout method is that the test set matters a lot.
Repeating steps 2 and 3 with a different partition from step 1 may give different results.
Cross-validation (CV) is the natural generalization of the holdout method.
Repeat the following steps \(k\) times
Let fold 1 be the test set and the other folds be the training set
Fit the model to the training set (like the holdout method)
Evaluate the prediction accuracy of the model on the test set (like the holdout method)
Go back to 1, letting the next fold be the test set.
\(\Rightarrow\) Select the model with the best overall prediction accuracy in step 3.
5-fold cross-validation
Source: towardsdatascience
Reasoning behind CV: let each observation be in the test set once.
set.seed(345)
n_folds <- 10
county_2019_nc_folds <- county_2019_nc %>%
slice_sample(n = nrow(county_2019_nc)) %>%
mutate(fold = rep(1:n_folds, n_folds)) %>%
arrange(fold)
predict_folds <- function(i) {
fit <- lm(uninsured ~ hs_grad, data = county_2019_nc_folds %>% filter(fold != i))
predict(fit, newdata = county_2019_nc_folds %>% filter(fold == i)) %>%
bind_cols(county_2019_nc_folds %>% filter(fold == i), .fitted = .)
}
nc_fits <- map_df(1:n_folds, predict_folds)
p_nc_fits <- ggplot(nc_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
x = "High school graduate", y = "Uninsured",
title = "Predicted uninsurance rate in NC",
subtitle = glue("For {n_folds} different testing datasets")
)
p_nc_fitsset.seed(123)
n_folds <- 10
county_2019_ny_folds <- county_2019_ny %>%
slice_sample(n = nrow(county_2019_ny)) %>%
mutate(fold = c(rep(1:n_folds, 6), 1, 2)) %>%
arrange(fold)
predict_folds <- function(i) {
fit <- lm(uninsured ~ hs_grad, data = county_2019_ny_folds %>% filter(fold != i))
predict(fit, newdata = county_2019_ny_folds %>% filter(fold == i)) %>%
bind_cols(county_2019_ny_folds %>% filter(fold == i), .fitted = .)
}
ny_fits <- map_df(1:n_folds, predict_folds)
p_ny_fits <- ggplot(ny_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
x = "High school graduate", y = "Uninsured",
title = "Predicted uninsurance rate in NY",
subtitle = glue("For {n_folds} different testing datasets")
)
p_ny_fitsSet \(k=n\); that is, we use \(n\) folds, each of size \(1\). The test sets will therefore consist of a single observation and the training sets of \(n-1\) observations.
Not my favorite method,
but this is something you should learn because it is widely used.
Stepwise selection procedures ar of two types: (i) forward and (ii) backward.
Choose a criterion that balances model fit (smaller SSR) and parsimony (small \(p\))
\(\Rightarrow\) adjusted-\(R^2\), AIC or BIC
Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model
Fit all possible models with one additional predictor.
If the AIC of the candidate model is smaller (better) than the AIC of the current model, the candidate model becomes the current model, and we go back to step 3.
If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.
Similar to forward selection, except that
Note that forward and backward selection need not agree; they may select different models.
05:00